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Abstract 

There has been a lot of research on random graph models for large real-world networks such as 
those formed by hyperlinks between web pages in the world wide web. Though largely successful 
qualitatively in capturing their key properties, such models may lack important quantitative 
characteristics of Internet graphs. While preferential attachment random graph models were 
shown to be capable of reflecting the degree distribution of the webgraph, their ability to reflect 
certain aspects of the edge distribution was not yet well studied. 

In this paper, we consider the Buckley-Osthus implementation of preferential attachment 
and its ability to model the web host graph in two aspects. One is the degree distribution that 
we observe to follow the power law, as often being the case for real-world graphs. Another 
one is the two-dimensional edge distribution, the number of edges between vertices of given 
degrees. We fit a single "initial attractiveness" parameter a of the model, first with respect to 
the degree distribution of the web host graph, and then, absolutely independently, with respect 
to the edge distribution. Surprisingly, the values of a we obtain turn out to be nearly the 
same. Therefore the same model with the same value of the parameter a fits very well the two 
independent and basic aspects of the web host graph. In addition, we demonstrate that other 
models completely lack the asymptotic behavior of the edge distribution of the web host graph, 
even when accurately capturing the degree distribution. 

To the best of our knowledge, this is the first attempt for a real graph of Internet to describe 
the distribution of edges between vertices with respect to their degrees. 

Keywords: web host graph, preferential attachment, random graph models, Buckley-Osthus 
random graphs, power law degree distribution, assortative mixing, edge distribution with respect 
to vertex degrees 

1 Introduction 

The study of the web as a hyperlink graph yields a valuable insight into web algorithms for crawl- 
ing, searching, and community discovery 1231 I33j . Valid random graph models of the web 

'Short version of this paper will be published in the proceedings of CIKM 2012 (see http://www.cikm2012.org/). 



1 



provide methods of generating WWW-like graphs that are significantly smaller and simpler than 
real WWW graphs, but yet preserve certain key properties of the hyperlink structure of the web. 
Such artificial graphs could serve as a convenient experimental platform, where new approaches to 
search, indexing, compression can be evaluated. 

Vertices of the webgraph correspond to web pages and edges represent hyperlinks between them. 
Webgraphs have been extensively studied with respect to many quantitative aspects such as degree 
distribution, diameter, number of connected components, macroscopic structure, and assortative 
mixing (e.g., see © E3 OS M, M, EE] ) . 

In this paper, we consider the web host graph. Vertices of this graph are web hosts and edges 
correspond to hyperlinks between their pages. The web host graph is much smaller than the 
webgraph, but is still a very useful resource and an abstraction of the web. For a lot of purposes, 
modern search engines consider hosts (and web sites associated with them) rather than web pages 
as the smallest possible entities in the web. Particularly, the smaller size of the web host graph 
allows for simpler and more efficient link analysis useful for web search related tasks. 

We study the web host graph from two perspectives. 

First, we look at the distribution of degrees of the web host graph vertices. It was shown that 
degrees of the webgraph vertices, much like in many other real world networks [22]) obey the power 
law [31 El [T21 El] . Albert et al. were the first to find the power law in the degree distribution of the 
web pages in the domain *. nd.edu [3]. Not surprisingly, we observe that the degree distribution in 
the web host graph also follows the power law. 

Second, we study how edges in the web host graph are distributed between vertices depending 
on degrees of these vertices. To the best of our knowledge, this is the first study of this property 
for graphs of Internet. However, in a reduced form, this notion was previously studied under the 
name of degree correlation or assortativity (e.g., see [6]). A convenient way of capturing the degree 
correlation is by examining the properties of d nn (d) , the expected average degree of neighbors of a 
random vertex with degree d. In real- word networks, one often observes d nn (d) ~ d s with some 5, 
which is negative for the webgraph (disassortativity) |38] and usually positive for social networks 
(assortativity) [3B]. We study a more general property of the distribution of edges between vertices 
with respect to their degrees, that is, the total number of edges X{d\,d2) between pairs of vertices 
with degrees di,d%. In fact, one can obviously derive both degree distribution and d nn (d) from the 

edge distribution: #(d) = X(d\,d), d nn (d) = -^j- x^^y - On the other hand, the degree 

distribution of a real-world graph does not determine its edge distribution, and therefore the latter 
may be considered as an additional more general aspect of the graph. In fact, there are assortative 
and disassortative real- world graphs with close power-law degree distributions, and therefore their 
edge distributions do differ as well. 

There are a number of important random graph models whose features (such as degree distri- 
bution, diameter, etc.) are supposed to be close to those of the real Internet graphs and social 
networks. Barabasi and Albert proposed the most well-known approach [4j that was realized in 
various preferential attachment models. Several of them have precise mathematical definitions - 
the Bollobas and Riordan model |10| . the Buckley and Osthus model [141 [20l I21j . the Copying 
Models [501 [32] , Directed Scale- Free Graphs [7] and the general model of Cooper and Frieze p2] 
(see also [32J). An extensive review of these models can be found elsewhere (e.g., see [HIE])- 

We focus on random graph models that allow for mathematical analysis of their properties and 
thus have to be relatively simple. Bollobas and Riordan were the first to propose a precisely defined 
preferential attachment model, and proved the power law for degree distribution in this model with 
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mathematical rigor. It was shown that the number of vertices with degree d decreases as d~" ( with 
7 = 3 |9j. On the other hand, Barabasi and Albert empirically estimated this exponent in the real 
webgraph to be approximately 2.1 ±0.1 [3]. As we show in this paper, the parameter 7 for the web 
host graph is also far from 3. Therefore the model of Bollobas and Riordan is not realistic for both 
the webgraph and the web host graph. The random graph model of Buckley and Osthus, which is 
a generalization of the Bollobas-Riordan model, solves this problem. Namely, the Buckley-Osthus 
model depends on a parameter a called initial attractiveness. For an integer value of a, Buckley 
and Osthus proved theoretically that the degree distribution in this model also follows the power 
law with the exponent —2 — a [14]. 

Recently Grechnikov obtained two results concerning the Buckley-Osthus model [26]. First, he 
extended the result from |14j about the degree distribution to an arbitrary positive a, not necessarily 
integer. Second, he obtained an accurate asymptotic estimate for the edge distribution for growing 
degrees as an explicit formula depending on a as a parameter. 

We expect the Buckley-Osthus model to be a good approximation of the web host graph to 
a certain extent. Relying on the both aforementioned theoretical results concerning two different 
aspects of the Buckley-Osthus model, we find the best fit of the initial attractiveness parameter a 
for the real web host graph assuming it is generated in the model. First, we choose the value of 
the parameter a so that the exponent in the power law for the real web host graph is close to 
—2 — a. In a second approach, completely independent from the first one, we estimate a using 
the best fit of the formula from [2B] for the edge distribution in the Buckley-Osthus model to the 
really observed edge distribution in the web host graph. Surprisingly, in both cases we find out 
that the model agrees very well with the real graph with the same value of a ~ 0.3. In other words, 
this very same model with a ~ 0.3 accurately approximates two completely different and a priori 
independent basic aspects of the web host graph, degree and edge distributions (and therefore also 
assortativity). This is especially impressive as the model itself is very simple and has only a single 
degree of freedom, namely the initial attractiveness parameter a. Note that we were able to describe 
the distribution of edges in a real web host graph only using the model for this graph and properties 
of this model obtained theoretically. Without the model, to come up with the asymptotics of the 
edge distribution would be really hard, if not impossible. 

We compare the Buckley-Osthus model and its ability to describe the real web host graph 
with other random graph models. We focus on the models that generate graphs with the degree 
distribution following the power law. We demonstrate that degrees do not immediately relate 
to edges between vertices given their degrees. Even after being fit to the web host graph with 
respect to degree distribution, the models are not able to capture the edge distribution and in fact 
completely lack the asymptotic behavior of this edge distribution observed in the web host graph 
and in the Buckley-Osthus model. 

To sum up, contributions of this paper are the following: 

• We have studied the distribution of degrees and the distribution of edges between vertices 
given their degrees for the web host graph. 

• Relying on previous theoretical study, we demonstrated that the web host graph corresponds 
very well to the Buckley-Osthus random graph model with the initial attractiveness parameter 
a rj 0.3 with respect to the degree distribution, the edge distribution and (consequently) the 
assortativity. We obtained the same value of the parameter two times by fitting the model with 
respect to the both independent quantitative aspects: the degree and the edge distributions. 
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• We generated graphs in the Buckley-Osthus model and empirically examined their theoret- 
ically proved properties in practice. We also showed that some other random graph models 
fail to capture the edge distribution of the web host graph though may successfully capture 
the degree distribution. 



To the best of our knowledge, this is the first attempt to empirically validate the Buckley- 
Osthus model on the real web data. Moreover, this is the first attempt to rigorously study the 
distribution of edges between vertices with respect to their degrees for graphs of Internet. 

The remainder of the paper is organized as follows. Section [2] is a short review on random graph 
models. In particular, in Section 2.2 we describe the theoretical properties of the Buckley-Osthus 
model critical for our experiments with the web host graph. We describe the experimental part 
of the work in Sections [3] and |4| In Section [3| we report the results of the approximation of the 
web host graph by the Buckley-Osthus model with respect to degree and edge distributions. In 
Section |4j we compare it with approximations by other models. In Section [5] we discuss potential 
applications and future work. 



2 Random Webgraph Models 

One possible theoretical approach to what the model of a webgraph might be is the mathematical 
concept of random graphs. The essence of this approach is in the idea of a webgraph developing 
stochastically. Once the rules or the parameters of this stochastic process are precisely specified, 
a random graph may obtain (sometimes unexpected) stable properties, in spite of the stochastic 
nature of its formation. Some of such properties may reflect those of the real webgraph rather 
accurately. 

There have been a lot of attempts to model the hyperlink graph of the web as a random graph. 
Probably the simplest (even regardless of the webgraph) are random graphs of the Erdos-Renyi 
model, where a graph is constructed by creating a fixed number of vertices and a fixed number of 
edges drawn independently uniformly at random over pairs of vertices. However, this model is not 
suitable for the webgraph (as well as the web host graph) as it lacks scalability, that is, does not 
have a power law degree distribution. 

In 1999, Barabasi and Albert [4J observed that the degree distribution of the real webgraph 
follows the power law with the exponent approximately equal to —2.1. They proposed a concept of 
preferential attachment that explained the phenomenon. The basic underlying idea is the following. 
A graph is constructed with a random process. At each step of the process, a new vertex is added 
and a fixed number of edges are added from the new vertex to randomly chosen already existing 
vertices. Vertices with higher degree acquire new edges with higher probability that linearly depends 
on their degree ("rich get richer"). 



2.1 Preferential attachment models 

The general idea of preferential attachment obtained a precise mathematical formulation in the 
model of Bollobas and Riordan [9] defined in the following way. We construct a series of graphs 
(Markov chain) GJ^,n = 1,2,..., with n vertices and mn edges, where m G Z is a fixed number. 
Let us consider the case m = 1 first. Let G\ be a graph consisting of one vertex with a self-loop. 
A graph G\ is obtained from G\~ l by adding a vertex t and an edge from t to a vertex i, where i 
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is chosen randomly within the existing vertices with the following probability distribution: 



P(i 



d G t-i(s)/(2t-l) ifl^s^i 
l/(2t-l) ifs = t, 



where d,Qt(s) denotes the degree of the vertex s in G\. A graph G 7 ^ is constructed from G™ n by 
merging the vertices 1, . . . , m into the vertex 1 of the new graph, merging the vertices m + 1, . . . , 2m 
into the vertex 2 of the new graph, etc. Note that one can consider the variant of the model with 
directed graphs: in this case, an edge between the vertices i and j goes from i to j if i > j. 

The Bollobas-Riordan model accurately captures some of the key properties of different real- 
world graphs. For instance, "small- world phenomenon" of many real- world networks, i.e., a surpris- 
ingly small diameter, is also observed in the model. Bollobas and Riordan proved that indeed the 
diameter of GJ^ is about i^°f^ n for large n [10] . They also showed that the degree distribution 
of GJ^ obeys the power law: the number of vertices with degree d in the model is well approxi- 
mated by <i _7 , with 7 = 3 |9j. However, this disagrees with the webgraph where the estimate 
Iwww = 2.1 ± 0.1 was observed [3J. This means that even though the Bollobas-Riordan model 
is similar in some aspects to real graphs of Internet qualitatively, it needs to be refined to better 
capture the reality quantitatively. In this work, we estimate the value of 7Host in the power law for 



the web host graph to be approximately 2.276 ± 0.001 (see Section 3.4). 

A possible approach for such a refinement is the model proposed independently by two groups 
of researchers [201 [21] . They proposed to extend the model with a parameter called initial attrac- 
tiveness of a vertex, a positive constant that does not depend on degree. Later Buckley and Osthus 
gave an explicit construction of this model [13] • The degree distribution of a Buckley-Osthus graph 
also obeys the power law, but now varying the value of a in the definition of the model, one can 
tune the exponent 7 in the power law of the resulting graph. 

More specifically, the model generates a series of graphs H™ m , n = 1, 2, . . . , with n vertices and 
mn edges, where m £ Z is a fixed number. The definition of recapitulates the definition of 
G^, and the only difference is that the probability of a newly added edge in H™^ equals 

P(i = s) - 



' d t _i(s)+o-l 

^ if 1 < s < t 



(a+l)t-l 



^ (a+l)t-l KS-t. 



A graph H™ m is obtained from H™™ in the same manner as from G™ n . Note that for a = 1, we 
obtain the initial Bollobas-Riordan model G^. For an integer a, Buckley and Osthus proved |14) 
that the degree distribution of a random graph in the model follows the power law with 7 = 2 + a. 



Previously, the Buckley-Osthus model has not been compared with real graphs. In Section 2.2 
we present further properties of the Buckley-Osthus model obtained recently and then use them 
in Section [3] for comparison of this model with the real web host graph. 

2.2 Properties of the Buckley-Osthus Model 

In this section, we present recent theoretical results on degree and edge distributions of the Buckley- 
Osthus random graph model [26] . Our experiments with real graphs (Section [3]) are based on the 
results from this section. 
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2.2.1 Degree distribution 



The following theorems show the dependence of the degree distribution in the Buckley-Osthus 
model on the initial attractiveness parameter a and generalize the results of [14J. 

For a given pair of functions /, g, we say that f(n) = 0(g(n)) if there exists a positive real 
number C such that ^ Cg(n) for sufficiently large n. We also say that f(n) = o(g(n)) if / 

is dominated by g asymptotically and f(n) = to{g(n)) if / dominates g asymptotically. 

Let # a (d, n) be the number of vertices with degree d in the model H£ m . We denote by EX the 
expectation of a random variable X . 

Theorem 1 (|26j) For d ^ m and for every fixed positive a, 

XN B(d — m + ma,a + 2) ^ / 1 

E # a d, n = f- >-n + O - 

B(ma, a + 1) \a 

Here B(x,y) is the beta function. Note that 

B(d - m + ma, a + 2) ^,,_ 2 _ a 

p r ~ Ga 

B(ma, a + 1) 

as d — > oo with C = (a + 1) "^j 1 " ; where T(x) is the gamma function (an extension of the 
factorial function). 

We say that a certain property holds whp (with high probability) if the probability of this 
property tends to 1 as n — > oo. The following concentration result shows that the degree distribution 
obeys the power law with 7 = 2 + a. 

Theorem 2 (|26j) Consider d ^ m to be the value of a function of n and ip(n) to be a function 
tending to infinity arbitrarily slowly. Then whp we have 

.. . B(d-m + ma,a + 2) 

#a{d,n) — — n ^ 

B(ma, a + 1) 

^ (Vd- a ~ 2 n + d' 1 ^ ip(n). 

In contrast with the result of |14j . a is not necessarily integer here. Roughly speaking, Theo- 
rems [T] and [2] imply that for large d, we have 

# a (d,n) ~hd- 2 - a n (1) 

with some constant 61 in an appropriate range of degrees d. 



2.2.2 Edges between vertices of given degrees 

In this subsection, we report the results capturing the behaviour of X a (d\, d%, n), the total number 
of edges between vertices of degree d± and vertices of degree c?2 in a Buckley-Osthus graph. For 
d\ = d2, we count every edge twice, but we do not count self-loops. We use this function in 
Section [3] comparing the web host graph with some random graph models including the Buckley- 
Osthus model. 

The number of edges between vertices of given degrees in the Buckley-Osthus model can be 
estimated in the following way. 
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Theorem 3 (|26j) Consider di,d2 to be the values of two functions of n tending to infinity as n 
grows. Then 

EX a (d 1 ,d 2 ,n) = c a (di,d 2 )n + 0(1), 

where 

., ,,T{ma + a + l){d 1 + d 2 ) 1 - a 
c a (d 1 ,d 2 ) = ma(a + l)^ ma) 

V \di d 2 {di + d 2 ) 2 
The following theorem is a concentration result. 

Theorem 4 ([2H]) Lei c> 0. T/ien 

P (|X a (di,d 2 ,n) - EX a (di,d 2 ,n)\ ^ c(di + d 2 )^mn) ^ 



2exp 



c 2 



In particular, for an arbitrary function c(n) tending to infinity as n grows we have whp \X — EX\ < 
c{n){d\ + d 2 )yjmn. 

Thus it follows that EX a (di,d 2 ,n) behaves as 

(di + d^d^dfn (2) 

if the ratio max(di,d2)/ min(di,d2) is sufficiently large (otherwise this formula does not capture 
the asymptotic behavior). 

In Section [3j we also use the fact that the number of loops and multiple edges in the Buckley- 
Osthus random graph is considerably smaller than the total number of edges. To be more precise, 
the following statement holds. 

Proposition 1 For every < a < 1 we have 

EN (loops in H^ m ) = O (Inn) , 

EiV(multiple edges in H™ m ) = O (n 1 ^) . 

Proposition [T] is proved in Appendix. Here we denote by iV(loops in i?„ TO ) the number of loops 
in i?" m and denote by ^(multiple edges in H™ m ) the number of multiple edges in the random 
graph. Recall that the total number of edges in H™ m is mn and dominates both n l ~ a and Inn. 
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2.3 Other results related to edge distribution 



The distribution of edges between vertices with respect to their degrees is closely related to an 
interesting quantitative characteristic of graphs called assortativity, or degree correlation [HI E21 ESI 
\57\ 158] . Informally, a graph is assortative (has a positive degree correlation) if vertices of high 
degree tend to connect with vertices of high degree. On the other hand, a graph is dis assortative 
(has a negative degree correlation) if vertices of high degree tend to connect with vertices of low 
degree. A convenient way of capturing the degree correlation is by examining the properties of 
d n n(d), the average degree of neighbors of a vertex with degree d (first average over neighbors of a 
vertex, and then over vertices of degree d). In real- word networks, often d nn (d) ~ d s with some 5, 
which is negative for the webgraph (disassortativity) |38| and usually positive for social networks 
(assortativity) [36]. Disassortativity of protein networks was studied in [53] . 

The function X(d\,d2) of edge distribution, the number of edges between vertices of degrees 
di,c?2 5 may be considered as a generalization of d nn (d). In fact, the latter can be restored from the 
former: 

, f „_ E dl dlX(d,d 1 ) 
J2 dl X (d, di) 

As mentioned in [36] and [38], networks in the Barabasi-Albert model [2] have d nn {d) ~ const 
and thus do not demonstrate assortative mixing. However, it can be shown experimentally that a 
graph in the Buckley-Osthus model, that is a generalization of the Barabasi-Albert model, may 
demonstrate assortativity (for a > 1) or disassortativity (for a < 1). In particular, we compare the 
web host graph and the Buckley-Osthus model with a ~ 0.3 with respect to the function d nn {d) in 
Section [4] and find them to be close to each other (see Fig. [6]). 

It was claimed in [35] that the negative degree correlation may be explained by the model 
where a graph is chosen uniformly at random from the set of all graphs with a prescribed power- 
law degree distribution without multiple edges. The authors stated that the resulting graph will 
have a negative degree correlation with high probability, for vertices of high degree are forced to 
connect each other rarely, or otherwise multiple edges will be more likely to appear. The authors of 
|37] obtain some theoretical results for a similar model. They also argue that the graph of Internet is 
disassortative. In [T3] the assortative co-authorship graph is modeled. The proposed model is based 
on preferential attachment, with an additional idea of adding new links between already existing 
vertices chosen based on their degrees. This idea can be utilized for modeling both assortative and 
disassortative graphs. However, in contrast with the Buckley-Osthus model, these models are not 
based on any natural rules that would explain the underlying laws of graph formation. They are 
rather specific and thus may be suspected in "overfitting" when approximating real graphs. 



3 Experiments on the web host graph 
3.1 Preliminaries 

Let us consider a random graph in the Buckley-Osthus model H^ m - For simplicity, we ignore edge 
directions, merge multiple edges and remove loops. Due to Proposition [TJ the difference between 
the obtained graph H and the initial one is not important for us and Theorems [TJ [2j [3] [4] are still 
applicable to H. 

In what follows, we denote by #(d) the number of vertices of degree d and by X(d\,d2) the 
total number of edges between all vertices of degree d\ and all vertices of degree di- The following 
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two properties follow from equations ([T]), Q. 

1) The function #(d) is approximated well by 

hd- 2 - a (3) 
for some constant 61 in an appropriate range of degrees. 

2) The number X(d\, d 2 ) of edges between vertices of degrees d±, d 2 is approximated well by the 
function 

b 2 (d 1 + d 2 ) 1 - a dfd2 2 (4) 
for some constant b 2 in an appropriate range of degrees. 



For the web host graph (undirected, without loops or multiple edges, see Section 3.2 for details) 
we define #Host(d) and X-£[ os t(di, d 2 ) in exactly the same manner as for a random graph H. Each of 
these two functions can be considered as an empiric density function of some distribution. Indeed, 
let £ be the degree of a random vertex and t/j be the ordered pair of degrees of vertices adjacent to 
a random edge (here the order of the vertices is also chosen randomly). Then the function #Host is 
the empirical density function of the random quantity £ and -Xnost is that of the random vector i/j. 

It is known that as d grows, the variation of the function #n st(d) may dominate its mean [T7] . 
see figures in [12]. The same might by true for the function Xu ost (di,d 2 ) as d\ and d 2 grow. 
Therefore it is more convenient, in particular less vulnerable to fluctuations of the data, to study 
the corresponding distribution functions instead of the density functions. 

To that end, we consider the following cumulative functions: 



#Host(rf) = ^ #Host(j), 



j>d 

Xuost(di,d 2 ) = ^2 Xu ost (ji,j 2 ), ^ 

jl ^32 ,jl >eZmax J2 > dmin 

~ / 7 , \ ^Host(dl) d 2 ) 

PUost{di,a 2 ) - 



#Host (^1 ) #Host (<^2) 

where d min = mm{di,d 2 }, d max = maxjtii, d 2 }. 

The main assumption that we make in our experiments is the following: we assume that the 
web host graph is obtained using a Buckley-Osthus graph model, such as the graph H described 
above. Under this assumption, one can show that the cumulative characteristics of the web host 
graph that we just defined have the following properties. 

1) The function #Host(rf) is approximated well by 

f ai ,bM) = bid' 1 -" 1 (6) 

for some constants 01,61 in an appropriate range of degrees. Note that the exponent in the 
power law is reduced by 1 after the integration. 
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2) The function pHost(^i, cfe) is approximated well by 

9a 2 ,bM = Hdi + d 2 f- a Mfdf (7) 

for some constants 02,62 in an appropriate range of degrees. Note that the approximating 
function does not change after the integration (see Appendix). 

Recall that under our assumption, we actually have a\ =0,2- However, it is worth mentioning 
that just the fact that some graph satisfies both properties 1) and 2), does not automatically imply 
the equality a\ = We explain it in detail in Section |4j 

In our experiments, we estimate the constants 61,62,01,02, with the following results. 

1) The values of the approximating functions /^^(d) and 502,62(^1)^2) are close enough to 
#Host(d) and pHost(di, ^2), respectively, for a sufficiently large range of degrees d,di,d,2- 

2) The estimated values of a\ and 02 are very close to each other, with relative difference only 
about 0.5%. 



These facts make us believe that our main assumption about the realization of the Buckley-Osthus 
random graph with respect to the two quantitative aspects is reasonable. The results are described 
in detail in Section T 



We describe and justify our method for estimation of the parameters 01, 02 in Section 3.3 The 
experiments with simulated graphs confirm the validity of this method (Section [4]). 



3.2 Data 

All experiments are performed with the web host graph crawled in November 2011 by the major 
Russian search engine yandex . ru. The robot is constantly crawling the web, collecting and updating 
web pages and links between them. From this data, cleaned from spam and duplicates, a web host 
graph can be constructed in the following way. Vertices of this graph correspond to owners. An 
owner roughly corresponds to all pages downloaded by the robot at least once that belong to 
the same second level domain. (In some cases a second level domain is subdivided into several 
owners. Sometimes different second level domains are merged into a single owner.) An edge 
between two vertices-owners is drawn if there is a link from a page of one owner to a page of 
another owner. For the purposes of our work, we further simplify the graph, making it undirected 
and removing duplicate edges and self-loops. The web host graph constructed in this manner 
consists of 86.8 million vertices and 1.33 billion edgea^ We do not suspect any bias in the way this 
data is collected that may substantially affect our results. 



3.3 Framework for Parameter Estimation 

In Section 3.1 we already explained that the functions / and g defined by Equations ^ and 
^ approximate #Host and pHost for some appropriate values of the parameters 01,61 and 02,62, 
respectively. In this section we describe the method we use to optimize these parameters in order 
to obtain the best possible approximations. 

Let A = {[a k ] : k £ N}, where a = 1.01, and [•] denotes the integer part of a number. 



1 To obtain the graph, please see http://events.yandex.ru/events/publications/ or contact the authors. 
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We use non-linear least squares method to minimize the overall deviation between empirical and 
theoretical functions over points from A. Most authors fit the power law distribution to empirical 
data using a plain linear regression in logdog scale. Problems with this approach and reasons why 
it is not appropriate for fitting the degree distribution of a real graph have already been discussed 
extensively [T7]. In our case, we see the following additional reasons not to use this method. 




10° 10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 s 10 9 10 : 
degree 



Figure 1: Degree distribution of the web host graph (for each degree, the number of vertices having 
this degree is represented with a black circle) and approximation using linear regression (green) in 
logarithmic scale. 




degree degree 



(a) (b) 

Figure 2: (a) Squared deviation between cumulative degree distribution and approximation using 
our method, (b) Squared deviation between square roots of cumulative degree distribution and 
approximation using our method. In both cases, linear regression for the range [io 2 - 9 , io 5 - 9 ] is 
shown for convenience. 

1) Empirical argument: Fig. [T] illustrates that the linear regression for log(#Host(d)) is a pretty 
bad approximation. 

2) It can be shown that (assuming the Buckley-Osthus model) the variance and the mean of 
#Host(^) have the same order of growth as d grows, and therefore the variance of y #Host(^) 
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can be bounded by a constant. This means that the following objective function 

ir>i 



\-< £ (y/*u<*t{d)-y/f aubl {d)) , 



(8) 



where D\ C A is a certain range of degrees, is much more appropriate for optimization, as in 
this case contributions of different summands are better calibrated. To illustrate the validity 
of this argument empirically, we plot the distribution of #Host(d) — fai,bi(d) (Fig . [2^,) and 

#Host(rf) — \j JaiJn (d) (Fig. J^Jd) for parameters of 01, 61 estimated in Section 



3.4 



3) The linear regression is not applicable to estimation of parameters 02,62, for function g can 
not be represented in a linear form. Moreover, we are again able to show that the variance 
and the mean of the empirical probability of an edge are of the same order. 

In accordance with 3), we estimate parameters 02, 62 minimizing the following objective function 

(y/pHoBtihj) - yt&2^i)J , ( 9 ) 



a y 

1 1 (ij)er> 2 



where D 2 = {(di, d 2 ) G D\ : d\/d 2 > 10}, and L>i is the degree range chosen for estimating ai, 61 (to 
be determined later). Note that we introduce the restriction on d\/d 2 in accordance with Theorems 
[3] and [4] that give a good estimation for p only for sufficiently large d\jd 2 (see Section 2.2.2). The 
value C = 10 was chosen manually. 

We minimize the objective functions ^ and ^ using the Gauss-Newton algorithm for a non- 
linear least squares optimization problem (e.g., see |2Hj)- Varying the degree range D\, we examined 
the product of the resulting optimized objectives Q, ^ and chose D\ = [10 2 9 , 10 5 ' 9 ] as the range 
of length 3 (in the logarithmic scale) with the minimal value of the product. The choice of the 
degree range for our approximations is further justified empirically in our observations on the 
deviations (Fig. [2]). For ranges of larger lengths, the optimized product of objectives starts to grow 
substantially. 

In the next subsection we describe the results of our experiments. 

3.4 Estimation for Empirical Cumulative Degree and Edge Distributions 

In this section we discuss the results of the two estimation methods for the parameter a described 
in Section I3~3l 

Table [T] shows the estimate of a 2 we obtained deriving the best fit of <?a 2 ,fc2 t° the empiric 
conditional probability PHost^ijGfo) that a pair of vertices v\,v 2 forms an edge in the web host 
graph given that max(deg(vi), deg^)) > max(di,d 2 ) and min(deg(t>i), deg^)) > min(di,d 2 ) (see 
Equation §5§ for the definition). 

We measure the accuracy of the estimation of a 2 employing bootstrapping in the following 
way. We sample the set of edges of the same size as originally, choosing each edge uniformly at 
random from the collection of all edges, with replacement. For each sample, we substitute the 
empirical function XHost(^l> ^2) with that for the sampled set, refresh pHost(^i) d 2 ) according to ^ 



and apply the estimating method described in Section 3.3 Applying the described procedure 1000 



times independently, we obtain one estimate for a 2 for each edge sampling. The normalized sum of 
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parameter 


degree distribution 


edge distribution 


a 


01 = 0.2762 


a 2 = 0.2774 


a 


2.631 


0.0599 




0.005666 


8.518 • 10~ 6 



Table 1: Results of the approximation of the cumulative distribution for degrees from the interval D\ = 
[10 2 9 , 10 5 9 ] and for edges between vertices with degrees from the domain D2 for the web host graph (see 



Sections 3.1 and 3.3 for details) 




degree 



Figure 3: Cumulative degree distribution (black) and approximation using our method (red) in 
logarithmic scale. For comparison, the result of approximation using linear regression in logdog 
coordinates is shown (green). 



squared deviations between these 1000 estimates and the one obtained from the initial dataset is 
shown in Table [l] as a 2 . We denote the normalized sum of squared deviations between g a2; b 2 (di, cfe) 
and pnost{di, efe) in the domain D2 by a 2 . 

For the chosen range of degrees D\ = [10 2 9 , 10 5,9 ], we obtain a 2 = 0.2774 and 62 = 8.331 • 10 -4 . 
The results of this approximation are shown on Fig. [4j We observe a very good fit of approximation 
with the data. Note that due to the term {d\ + c^) 1-0 predicted theoretically, the approximation 
was even able to capture a concave area around the diagonal d\ = di- This would not be possible 
with a simpler approximation of the form dfd^- 

The result of estimation of at that we obtained approximating #Host by the function / ai)t)1 
in the range of degrees D\ is also shown in Table [TJ We also measure the estimation accuracy 
using bootstrapping, sampling with replacement 1000 sets of vertices, applying our method to the 
corresponding degree distribution and obtaining 1000 values of estimates. The normalized sum of 
squared deviations between f ai ,b 1 {di,d2) and #Host(^i, cfe) is denoted by a 2 as well. 

We want to stress that surprisingly we obtained the same value a ~ 0.27 of the parameter 
approximating independently the degree and the edge distributions. This is a double evidence that 
the Buckley-Osthus model is good for the web host graph. In the next section, we further support 
this claim, comparing it with other models. 
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Figure 4: Cumulative edge distribution (blue) and approximation using our method (green) in 
logarithmic scale (axes are labeled with log 10 of the values, pictures differ only in the view angle). 



4 Experiments on Simulated Graphs 

Here we describe the results of our experiments with graphs artificially generated in various random 
graph models. We have two goals: to demonstrate that for a random graph with the power law 
degree distribution the probability of an edge between vertices of given degrees is not determined 
by the exponent in the power law, and to show that the Buckley-Osthus model has the best 
approximation to the web host graph as compared with other models. 

First of all, we generate ten samples of the Buckley-Osthus (BO) random graphs with 86. 8M 
vertices with a = 0.276 and m = 12 (close to the ratio of the number of edges and the number 
of vertices observed in the actual web host graph). The cumulative degree and edge distributions 
of one resulting graph are shown on Fig. [7] and [5j respectively, in comparison with those for the 



web host graph. In both cases, we observe a strong fit, recapitulating the results from Section 3.4 
(compare with Fig. [4|. 

Fig. [6] compares the function d nn , average degree of a neighbor, for the web host graph and 
a sample generated in the Buckley-Osthus model with a = 0.276 that corresponds to the best 
approximation by the model. As expected, the two distributions are very close to each other. 
Interestingly, even fluctuations of the two are very similar. 

In addition to the Buckley-Osthus model, we consider two other random graph models: the 
configuration model (GDS) and the Holme-Kim model (HK). 
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The first model chooses from all graphs with a specified fixed degree sequence uniformly at 
random [5]. For our experiment, we generate a sequence of 86. 8M numbers following the power law 
distribution with the exponent —2.276 and use this distribution as a degree sequence in the model. 
Then we generate five samples of random graphs in this model with 86. 8M vertices and 128M edges 
using a simple simulation algorithm [5j. The degree distribution of the resulting graph follows the 
power law by construction. 

The second model is based on the idea of preferential attachment with triad formation steps 
in the graph construction process [28]. We generate nine samples of random graphs with 86. 8M 
vertices and IB edges. Degree distribution of the resulting graph follows the power law with the 
exponent —3. 

The degree and the edge distributions for a single sample from both models in comparison with 
those for the web host graph are shown on Fig. [7] and |8j respectively. 

For each of the simulated graphs, we apply exactly the same two approximation procedures as 
described in Section and previously applied to the web host graph. Table [2] shows the results: 
v and e are the number of vertices and edges in the sample graphs, a\ and 02 are the parameters 
of the best fit for degree and edge distributions, respectively. Note that the algorithm diverges 
for edge distribution approximation of the HK model, and the value of 02 is not defined in this 
case. We also show the standard deviation of the obtained estimates of a\ and 02 over the several 
samples of the model. The GDS model has a fixed degree distribution that results in always the 
same estimate of a±. 

Not surprisingly, the approximation algorithm extracts the parameters a\ and 02 planted in the 
sample of the BO model with high accuracy, as it is the underlying assumption of this algorithm 
that the graph is modeled by the Buckley-Osthus model. 




Figure 5: Cumulative edge distributions for the web host graph (blue) and for the Buckley-Osthus 
simulated graph (cyan) in logarithmic scale (axes are labeled with log 10 of the values, pictures differ 
only in the view angle). 

Although all generated graphs have the power law degree distribution, only the Buckley-Osthus 
graph has the edge distribution close to that observed in the real web host graph. 
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Figure 6: Average degree of a neighbor of a vertex depending on the degree of this vertex for the 
real web host graph (black) and a sample generated by the Buckley-Osthus model (green) with 
a = 0.276 (corresponding to the best approximation). 



model 


parameters 


estimates 


BO 


v = 8.68- 10 7 , e = 1.04- 10 9 


ai = 0.289 ± 0.0033, a 2 = 0.274 ± 0.0038 


GDS 


t; = 8.68-10 7 , e = 1.26-10 8 


ai = 0.29±0, o 2 = 1.053 ±0.00048 


HK 


v = 8.68- 10 7 , e = 1.04- 10 9 


oi = 1.06 ±0.0088, a 2 = n/a 



Table 2: Results of the approximation of the cumulative distributions of degrees from the interval 
D\ = [10 2 ' 9 , 10 5 ' 9 ] and edges between vertices with degrees from the interval D\ for generated 



graphs (see Sections 3.1 and 3.3 for details). Number of vertices and edges in graphs are shown as 
v and e, respectively. Results of the approximation using the method described in Section 3.3, are 
shown as a\ and a 2 . 




Figure 7: Cumulative degree distributions for the web host graph (blue), the BO simulated graph 
(cyan), the GDS simulated graph (red), and the HK simulated graph (orange) in logarithmic scale. 

5 Conclusion 



In this paper we study the degree and edge distributions of the web host graph. We compare it 
with the Buckley-Osthus model of random graphs and find that the model agrees with the real 
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Figure 8: Cumulative edge distributions for the web host graph (blue), the GDS simulated graph 
(red), and the HK simulated graph (orange) in logarithmic scale. Pictures for the GDS model differ 
only in the view angle. 

data. More precisely, we use two different approaches to estimate the initial attractiveness param- 
eter a assuming the web host graph is generated in the Buckley-Osthus model. In two different 
independent attempts, we compare the distribution of the number of edges between vertices with 
respect to their degrees and the degree distribution in the real graph with theoretical predictions 
for the Buckley-Osthus model. The values of a obtained with two methods are very close to each 
other, and therefore we conclude that the web host graph is very similar to the Buckley-Osthus 
random graph with this particular value of a. 

Besides our results being interesting on their own, we believe they may potentially be related 
with real world problems of practical interest. 

One example of such a relation may be the work of Y. Lu et al. that made use of the power 
law degree distribution in the webgraph and proposed the algorithm PowerRank, an improvement 
over PageRank. We may expect that further empirical and theoretical studies of graphs representing 
the Internet may help progress in other tasks related with search and in particular with ranking 
and crawling. 

It has been argued that the web contains many communities, sets of pages or hosts that are 
in particular characterized by abnormally high density of links between them [25j I29j . In 
this respect, understanding how edges are distributed in the graph may potentially be useful for 
algorithms detecting and testing such communities, providing a better description of expected 
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background that prospective communities may be compared against. We expect that theoretical 
and empirical results in the direction presented in this paper may prove useful for these problems. 

One can imagine a lot of directions for future work related with our results, both theoretical 
and practical. 

It would be interesting to continue to study the Buckley-Osthus random graph model, as well as 
other models, and extend theoretical knowledge of their properties. For the first time we described 
the distribution of edges between vertices given their degrees in a real Internet graph. Now it is 
interesting to compare different models with respect to this property, and our techniques may be 
useful. 

Even though we showed a good correlation of the model with real data, we had to simplify 
the data in certain important aspects. It would be interesting to generalize existing random graph 
models or probably to develop new ones that could model graphs closer to the reality: with multiple 
edges, directed, hierarchical, dynamically evolving with time. In particular, the clustering coefficient 
of a Buckley-Osthus graph still significantly differs from the one in the reality. However, some of 
the aspects of the Buckley-Osthus model may be promising. 

It would definitely be interesting to develop and test the aforementioned and similar ideas of 
applications to ranking, crawling, and community detection. We strongly believe that deeper and 
broader theoretical results on models of Internet graphs coupled with empirical observations of 
certain characteristics of real such graphs may lead to practical applications and insights. 
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A Proof of Proposition 1 

We can estimate the expectation of the number of loops in the following way: 



£iV(loops in Hl m ) = O [J2 - = O (Inn) . 
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To estimate the number of multiple edges we should take into account that we have no vertices of 
degrees greater than 2mn in H™ m . Also (using the same ideas as in the proof of Theorem 3) it can 
be shown that E# a (d,i) = O (^r-). Therefore 

£iV(multiple edges in H% m ) = 



EE*'*"' w 

i=i d=i 



n 2mi . , 2 % 



,i=l d=l 



B Proof of the theoretical approximation in Equation (7) 

Here we prove the theoretical approximation from Equation Q for the empiric conditional proba- 
bility PH os t(d 1 ,d 2 ). 

First, for sufficiently large d\/d2, we obtain the following approximate formula using the esti- 
mations (§ and @: 



~ , , , v b 2E^ j , l >d 1 . j >d 2 ( i + j) la2 ( i j)" 

PKost(dl,d 2 ) « 2 ^ -2Z^7 

°il^i>d 1 l l^j>d 2 3 



(10) 



For d\jd 2 large enough, the numerator of the right-hand side of (10) equals 

i>d 1 ^j>d 2 i^j>d\ 

ci(di + d 2 ) 1 - a (d 1( i 2 )- 1 + cs^)- 1 -" « ci(di + ck) 1 """^!^)" 1 



for some constants ci, c 2 . Estimating the denominator of the right-hand side of ( 10 ) by c{d\d 2 ) 1 a2 , 
we get ioHost(^l)^2) ~ ga 2 ,b 2 (di,d 2 ), where g a2M is defined by Q. 
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